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CORRELATION  IN  POLYNOMIAL  REGRESSION* 


Ralph  A.  Bradley  and  Sushil  S.  Srivastava** 


This  paper  considers  the  effects  of  centering  on  correlation  and  ill  condition- 
ing in  polynomial  regression.  Standard  statistical  texts  on  regression  give  little 
attention  to  the  computational  problems  involved.  Recent  articles  on  regression, 
and  they  are  numerous,  sometimes  hint  at  the  need  for  centering  and  scaling  of  the 
independent  variables  but  do  not  give  detail.  It  is  hoped  that  the  demonstration 
given  below  may  dramatize  the  need  for  serious  attention  to  centering  in  polynomial 
regression.  While  it  is  difficult  to  believe  that  this  demonstration  is  new,  we 
have  not  seen  it  before  and  believe  that  it  merits  attention. 

Marquardt  and  Snee  [3],  in  commenting  on  common  practices,  write: 

"Cne  common  practice  we  note  is  failure  to  remove  nonessential  ill  con- 
ditioning through  the  use  of  standardized  predictor  variables.  ... 

The  ill  conditioning  that  results  from  failure  to  standardize  is  all  the 
more  insidious  because  it  is  not  due  to  any  real  defect  in  the  data,  but 
only  to  the  arbitrary  origins  of  the  scales  on  which  the  predictor  vari- 
ables are  expressed.  ...  In  a linear  model  centering  removes  the  correla- 
tion between  the  constant  term  and  all  linear  terms.  In  addition,  in  a 
quadratic  model  centering  reduces,  and  in  certain  situations  completely 
removes,  the  correlation  between  linear  and  quadratic  terms." 

Brown  [1]  considers  centering  and  scaling  but  places  emphasis  on  the  use  of  ridge 
regression  to  avoid  centering.  Hocking  [2],  in  an  excellent  major  review  of  analysis 
and  selection  of  variables  in  linear  regression,  does  not  address  the  centering 
problem. 
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Before  the  easy  availability  of  computers  and  regression  programs,  it  was 
standard  practice  to  "code"  the  variables  to  simplify  the  computational  problems 
of  regression.  Thus  the  centering  problem  did  not  arise.  But  regression  programs 
seemed  to  eliminate  computational  difficulties  and  their  naive  uses,  particularly 
by  computationally  unsophisticated  users,  seems  likely  to  have  led  to  misuses  with 
failure  to  center  being  a common  problem. 

The  purpose  of  this  note  is  to  highlight  a basic  data  analytic  problem  which 
can  arise  in  fitting  a polynomial  regression  equation  to  an  observation  matrix  when 
the  independent  variables  are  not  centered.  It  suffices  to  consider  a quadratic 
model  in  one  independent  variable  although  the  same  problem  occurs  with  more  complex 
polynomials. 

Consider  the  regression  model, 

(1) 


b(y)  « e0x0  ♦ BjXj  ♦ e2x2. 


when  E(Y)  is  the  expectation  of  a random  variable  Y dependent  on  selected  values 
of  variables  Xq,  Xj  and  Xj.  Least  squares  estimation  of  the  regression  coefficients 
Bq,  Bj  and  B2  is  desired  from  N observation  vectors,  (yo,  xQa,  xla,  x2a), 
a*l,  ....  N.  Let  X be  the  N*3  matrix  with  a-th  row  (x^,  x1(j,  x2o).  It  is 
well  known  that  the  estimation  depends  on  (X'X)~*  and  that  X'X  will  be  singular 
(nearly  singular  or  ill  conditioned)  if  the  sample  correlation  between  x^  and  x2 
is  unity (near  unity)  in  absolute  value. 


The  quadratic  model  is 


E(Y)  - B0  ♦ BjX  ♦ B2x 


(2) 
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2 

and  it  is  a special  case  of  (1)  with  xQ  = 1,  Xj  ■ x,  x2  * x . Here  it  is  the 

2 

sample  correlation  between  x and  x that  is  of  interest.  Computer  programs 

permit  direct  use  of  (1)  and,  in  polynomial  regression,  the  programs  are  used  with 

• the  specifications  needed  to  obtain  (2),  often  without  concern  for  the  centering  of 

2 

x.  We  investigate  the  sample  correlation  between  x and  x when  x is  not  centered. 


Let  x «*  x'  ♦ x,  x * i Z x : x1  is  centered  at  zero,  x is  centered  at  x 


nent  central  sample  moments  of  x be  mv  = Z (x')  , k « 2,  3,  4.  It  is  easy  to 

^ n Or 
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which  may  be  any  value  depending  on  the  choice  of  origin  for  x.  Let  the  three  perti- 
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show  that  the  sample  variances  of  x and  x are  and  (rn^-n^  ♦ 40jX  + 4m2X  ) 

2 

respectively  and  that  the  sample  covariance  between  x and  x is  * 2b»2x. 

2 

The  desired  sample  correlation  coefficient  between  x and  x is 


r(x)  » (m3+2m2x)/{m2(m4-m^  ♦ 4mjX  ♦ 4t^ x 2))^2. 


(3) 


It  is  clear  that  r(x)+  ±1  as  x -►  ± •.  Choice  of  x ■ -m3/2m2  gives  r(x)  ■ 0 
and  is  a choice  of  origin  equivalent  to  the  use  of  orthogonal  polynomials,  to  degree 
two  in  this  case,  in  the  description  of  the  model  (2). 

Let  x ■ cs,  S2  e Then,  from  (3), 


r(c)  - (m3+2s3c)/{s2(o4-s4  ♦ ^c  ♦ 4s4c2}1/2. 


(4) 


A plot  of  r (c)  is  as  follows: 
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On  the  plot,  cQ  * -m^/2s  and  a and  b are  inflection  points  where  the  second 
derivative  of  r2(c)  with  respect  to  c changes  sign,  the  values  respectively  being 

-m3  - J (m^-n^3  - m32)/3  and  -m3  + " m23  " 

2 ? 

The  choice  of  the  N values  of  x should  be  open  to  the  experimenter.  Thus, 
if  they  are  chosen  in  a symmetric  fashion  defined  to  mean  that  m3  * 0 and  if  x 
is  centered,  c * 0,  r(0)  * 0,  and  the  assertion  of  Marquardt  and  Snee  that  correla- 
tion may  be  completely  removed  in  certain  circumstances  is  justified. 

Suppose  that  values  of  x are  chosen  so  that  they  form  a moment  pattern, 

2 2 

m3  * 0,  m^  = Zs*,  like  those  of  a normal  variate.  Then  r (c)  = 1/(1  ♦ i/2c  ) and 
r2(c)  »»  .67,  .89,  .95  as  c ■ 1,  2,  3.  Since  c is  measured  in  terms  of  the  stan- 
dard deviation  of  x,  |r(c)|  is  seen  to  be  close  to  unity  for  even  relatively 
small  values  of  c and  a choice  of  origin  for  x far  from  x in  units  of  standard 
deviations  of  x is  most  unfortunate.  More  extreme  results  follow  if  the  values 
of  x are  chosen  in  a uniform  pattern  over  an  interval  with  a moment  pattern, 

m3  » 0,  m4  ■ 1.8s  , like  those  of  a uniform  variate.  Then  r (c)  * l/(l+l/5c  ) 

2 

and  r (c)  = .83,  .95,  .98  as  c ■ 1,  2,  3. 

The  authors  encountered  the  problem  addressed  in  this  paper  in  some  prelimi- 
nary analysis  of  a weather  modification  experiment  that  involved  the  fitting  of  poly- 
nomials in  two  variables  with  the  variables  badly  centered.  The  simple  analysis 
given  above  demonstrates  clearly  that  failure  to  center  the  independent  variables  in 
polynomial  regression  can  lead  to  correlations  near  unity  and  to  ill  conditioning. 
Choices  of  origins  far  from  centers  can  be  disastrous  and  it  is  clearly  safest  to 
center  the  independent  variables  at  their  means.  The  development  given  is  easy  and 
could  provide  an  effective  problem  assignment  in  a first  course  on  regression.  The 
development  is  of  sufficient  importance  to  require  the  introduction  of  cautions 
in  discussions  of  regression  in  texts. 
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